Goto

Collaborating Authors

 Concord


Evaluating Long-Context Reasoning in LLM-Based WebAgents

Chung, Andy, Zhang, Yichi, Lin, Kaixiang, Rawal, Aditya, Gao, Qiaozi, Chai, Joyce

arXiv.org Artificial Intelligence

As large language model (LLM)-based agents become increasingly integrated into daily digital interactions, their ability to reason across long interaction histories becomes crucial for providing personalized and contextually aware assistance. However, the performance of these agents in long context scenarios, particularly for action-taking WebAgents operating in realistic web environments, remains largely unexplored. This paper introduces a benchmark for evaluating long context reasoning capabilities of WebAgents through sequentially dependent subtasks that require retrieval and application of information from extended interaction histories. We develop a novel evaluation framework that simulates multi-session user interactions by injecting irrelevant task trajectories between dependent subtasks, creating contexts ranging from 25,000 to 150,000 tokens. Through extensive evaluation of four popular models, Claude-3.7, GPT-4.1, Llama 4, and o4-mini, we observe a dramatic performance degradation as context length increases, with success rates dropping from 40-50\% in baseline conditions to less than 10\% in long context scenarios. Our detailed error analysis reveals that agents primarily fail due to getting stuck in loops and losing track of original task objectives. We further propose an implicit RAG approach that provides modest improvements by generating task-relevant summaries, though fundamental limitations in long context reasoning persist. These findings highlight critical challenges for deploying WebAgents in realistic, long-term user interaction scenarios and provide insights for developing more robust agent architectures capable of maintaining coherent task execution across extended contexts.


PIFON-EPT: MR-Based Electrical Property Tomography Using Physics-Informed Fourier Networks

Yu, Xinling, Serrallés, José E. C., Giannakopoulos, Ilias I., Liu, Ziyue, Daniel, Luca, Lattanzi, Riccardo, Zhang, Zheng

arXiv.org Artificial Intelligence

We propose Physics-Informed Fourier Networks for Electrical Properties (EP) Tomography (PIFON-EPT), a novel deep learning-based method for EP reconstruction using noisy and/or incomplete magnetic resonance (MR) measurements. Our approach leverages the Helmholtz equation to constrain two networks, responsible for the denoising and completion of the transmit fields, and the estimation of the object's EP, respectively. We embed a random Fourier features mapping into our networks to enable efficient learning of high-frequency details encoded in the transmit fields. We demonstrated the efficacy of PIFON-EPT through several simulated experiments at 3 and 7 tesla (T) MR imaging, and showed that our method can reconstruct physically consistent EP and transmit fields. Specifically, when only $20\%$ of the noisy measured fields were used as inputs, PIFON-EPT reconstructed the EP of a phantom with $\leq 5\%$ error, and denoised and completed the measurements with $\leq 1\%$ error. Additionally, we adapted PIFON-EPT to solve the generalized Helmholtz equation that accounts for gradients of EP between inhomogeneities. This yielded improved results at interfaces between different materials without explicit knowledge of boundary conditions. PIFON-EPT is the first method that can simultaneously reconstruct EP and transmit fields from incomplete noisy MR measurements, providing new opportunities for EPT research.


A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis

Gur, Izzeddin, Furuta, Hiroki, Huang, Austin, Safdari, Mustafa, Matsuo, Yutaka, Eck, Douglas, Faust, Aleksandra

arXiv.org Artificial Intelligence

Pre-trained large language models (LLMs) have recently achieved better generalization and sample efficiency in autonomous web automation. However, the performance on real-world websites has still suffered from (1) open domainness, (2) limited context length, and (3) lack of inductive bias on HTML. We introduce WebAgent, an LLM-driven agent that learns from self-experience to complete tasks on real websites following natural language instructions. WebAgent plans ahead by decomposing instructions into canonical sub-instructions, summarizes long HTML documents into task-relevant snippets, and acts on websites via Python programs generated from those. We design WebAgent with Flan-U-PaLM, for grounded code generation, and HTML-T5, new pre-trained LLMs for long HTML documents using local and global attention mechanisms and a mixture of long-span denoising objectives, for planning and summarization. We empirically demonstrate that our modular recipe improves the success on real websites by over 50%, and that HTML-T5 is the best model to solve various HTML understanding tasks; achieving 18.7% higher success rate than the prior method on MiniWoB web automation benchmark, and SoTA performance on Mind2Web, an offline task planning evaluation.


Creative AI, FinOps among hot developer trends of 2023

#artificialintelligence

A handful of important trends will transform the software developer experience in 2023, as enterprises consider more self-hosting, observe more SaaS consolidations and see an upswing of interest in creative AI. Also, as AI enters the creativity realm, it threatens to upend the future of app dev. And OpenAI's Chat GPT, released in November, takes code completion beyond line suggestions -- in addition to writing complete web pages and simple applications, it can generate new programming languages. For developers, the 2022 job market started strong, but by December, they saw storm clouds as layoffs hit the tech sector. Experts felt vibes of the early 2000s recession and the pandemic's early days.


High School Sophomore Arrested For Hacking Computer System, Changing Grades Of Other Students

International Business Times

A Northern California teen was arrested Wednesday for hacking a school district's computer system and changing the grades of up to 15 students. Authorities said they arrested David Rotaro, a sophomore at Ygnacio Valley High School in Concord, California, for infiltrating the school district's computer system. Rotaro, 16, said it was like "stealing candy from a baby," according to KGO-TV, an ABC affiliate in San Francisco. It took him five minutes to design a "phishing email," that he sent out to swipe login information from school faculty. Authorities didn't release Rotaro's name, however, he confessed to having committed the crime during an interview with KGO-TV.


Where are self-driving cars being tested?

FOX News

An Arizona woman was killed after being struck by a self-driving Uber vehicle, an incident believed to be the first of its kind. But Uber is not the only company that has experienced accidents with driverless cars. Companies like Google, Tesla and General Motors also join the list. An Arizona woman was killed after being struck by a self-driving Uber vehicle this week - prompting the company to suspend all testing of self-driving vehicles in cities across the country. The Uber was in autonomous mode at the time of the collision in Tempe, and there was a vehicle operator behind the wheel, police said.


Uber's Robo-Truck, McLaren's Senna Supercar, and More Cars News This Week

WIRED

If the phrase "autonomous vehicle" makes you think of some four-wheeled pod tootling around the city, you need to think bigger. For all the talk of robo-taxis, the smart money says that when this tech comes for our roads, it'll start on the highway. And if you're looking for proof, grab your sunglasses, a trucker hat, and a ticket to Arizona or Florida--the testing grounds of choice for the companies teaching trucks to drive themselves. This week, we have news of Uber testing in the Copper State and startup Starsky Robotics sending a truck down a Florida highway, all by itself. Meanwhile, the titans of the auto industry met at the Geneva Motor Show, where the talk centered on supercars--and how to take down Elon Musk.


Real-Time Energy Disaggregation of a Distribution Feeder's Demand Using Online Learning

Ledva, Gregory S., Balzano, Laura, Mathieu, Johanna L.

arXiv.org Machine Learning

Though distribution system operators have been adding more sensors to their networks, they still often lack an accurate real-time picture of the behavior of distributed energy resources such as demand responsive electric loads and residential solar generation. Such information could improve system reliability, economic efficiency, and environmental impact. Rather than installing additional, costly sensing and communication infrastructure to obtain additional real-time information, it may be possible to use existing sensing capabilities and leverage knowledge about the system to reduce the need for new infrastructure. In this paper, we disaggregate a distribution feeder's demand measurements into: 1) the demand of a population of air conditioners, and 2) the demand of the remaining loads connected to the feeder. We use an online learning algorithm, Dynamic Fixed Share (DFS), that uses the real-time distribution feeder measurements as well as models generated from historical building- and device-level data. We develop two implementations of the algorithm and conduct case studies using real demand data from households and commercial buildings to investigate the effectiveness of the algorithm. The case studies demonstrate that DFS can effectively perform online disaggregation and the choice and construction of models included in the algorithm affects its accuracy, which is comparable to that of a set of Kalman filters.


WWII bombers once built on new Michigan driverless car test site

USATODAY - Tech Top Stories

The ex-bomber plant and home of Rosie the Riveter will transform this year into an autonomous vehicle technology test site. It once housed one of the largest factories in the world, pumping out B24 bombers to help America and her allies win World War II, and later transmissions when it was owned by General Motors. It once housed one of the largest factories in the world, pumping out B24 bombers to help America and her allies win World War II, and later transmissions when it was owned by General Motors. The former Willow Run bomber plant in Ypsilanti Township is mostly a memory now, demolished following GM's 2009 bankruptcy, except for a piece that houses the Yankee Air Museum. Land at the former 335-acre Willow Run site in Ypsilanti Township where the American Center for Mobility is located on in January 2017 that will be used for testing autonomous vehicles.


Honda's Self-Driving Car Goes For A Test Run

Forbes - Tech

Forbes allows marketers to connect directly with the Forbes audience by enabling them to create content – and participate in the conversation – on the Forbes digital publishing platform. Each is produced by the marketer. More on here, or contact us directly at brandvoice.com. Opinions expressed by Forbes Contributors are their own. On an old, decommissioned naval base in Concord, California, Honda showed off its latest advancements in autonomous car technology.